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ABSTRACT 

Motivations: High-throughput sequencing has made it possible 
to sequence DNA methylation of a whole genome at the single- 
base resolution. A sample, however, may contain a number of 
distinct methylation patterns. For instance, cells of different types 
and in different developmental stages may have different methyl- 
ation patterns. Alleles may be differentially methylated, which may 
partially explain that the large portions of epigenomes from single 
cell types are partially methylated, and may have major effects 
on transcriptional output. Approaches relying on DNA sequence 
polymorphism to identify individual patterns from a mixture of 
heterogeneous epigenomes are insufficient as methylcytosines 
occur at a much higher density than SNPs. 

Results: We have developed a mixture model-based approach for 
resolving distinct epigenomes from a heterogeneous sample. In 
particular, the model is applied to the detection of allele-specific 
methylation (ASM). The methods are tested on a synthetic methylome 
and applied to an Arabidopsis single root cell methylome. 
Contact: lqpeng@cs.ucsd.edu] 



1 INTRODUCTION 

The advancement of high-throughput sequencing has opened up 
many important areas of applications, one of which is epigenome 
sequencing. DNA methylation may repress or activate transcription, 
and is known to be involved in embryogenesis, genomic imprinting 
and tu morigenesis i n mammals, and transposon silenci ng in 
plants ( Besto rl l2000l: iLi et all 1 1991 llippman et all l2004l: Rhee 



et a/.. l2002l:]zhang et all 120061 : (zilberman et all l2007l) . To 
understand the regulation and dynamics of DNA methylation, the 
locations of the modified cytosines need to be identified. The first 
single-base resolution mappings of DNA methy lation were produced 
for the who l e Ara bidopsis thaliana genome (iCokus et all 120081 ; 
ILister et "all 120081) and for sel ected subsets of sites in the mouse 
genome ( Meissner et all 120081) using various bisulfite sequencing 
technologies. DNA methylation is the modification of DNA base 
cytosine (methylcytosine). A map of DNA methylation at single-base 
resolution is referred to as methylome or epigenome. 

Cells of different types and in different developmental stages 
may have different methylation patterns. It has been observed 
that large p ortions of th e Arab idopsis methylomes are partially 
methylated ( ILister et all 120081) . This may be as a result of the 
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sample containing a number of distinct methylomes. Aberrant 
methylation is also a general feature of cancer genomes. A better 
understanding of methylation patterns in cancer genomes may lead 
to both new diagnostic markers and therapies based on the de tection 
of m ethylation changes occurring early in tumorigenesis (iLairdl 
120031) . In a tumor tissue particularly of an early stage, however, 
cancerous cells and normal cells are often mixed together. DNA 
methylation patterns can also act as markers for tracing stem cell 
expansion and tumor grow th (iKim et a/1 120051 ; IShibata and Tavaret 
120061 : lYatabe et q/.LboOlh . Making use of methylation patterns in 
this way requires determining methylation patterns associated with 
individual cells or cells from the same clone. 

When comparing human fibroblast cell IMR90 and HI embryonic 
stem cell (ESC) lines, it is o bserved that IMR90 has a lower level 
of methylation than HI ESC (ILister et q/.ll2009h . Both IMR90 and 
HI are of a single cell type. While it is expected that large portions 
(80%) of the X chromosome are partially methylated as the IMR90 
cell line is from a female and the DNA methylati on is known to play 
an important role in X chromosome inactivation jRiggslll975h . it is 
unexpectedly observed that around 38% of IMR90 autosomes are 
identified as partially methylated domains (PMD). What is the nature 
of partial methylations in a single cell type? Might allelic differences 
contribute to the partial methylations? Answering these questions 
requires detecting allele-specific methylation (ASM) patterns. 

Th e methylcytosine i s som etimes referred to as the fifth DNA 
base (ILister and Ecked. l2009h . Applying methods for detection of 
single nucleotide polymorphism (SNP) to methylation, however, 
may present difficulties. Methylcytosine is much more dynamic than 
nucleotides and observations generally suggest that methylation of 
a cytosine site is a statistical event. Unlike SNPs where a nucleotide 
occurs at a rate of 0, 50 or 100% in a diploid individual (if sequencing 
errors may be ignored), the methylation level at a particular site 
may fall anywhere in the range from 0% to 100%. Whether a 
methylcytosine is allele-specific therefore cannot be determined by 
the site alone. It needs supporting evidence from the neighboring 
nucleotides. 

If a partially methylated cytosine is in the close vicinity of a 
SNP such that reads are long enough to cover both sites, then 
it is straightforward to determine from which allele the reads are 
originated thus determining the methylation level of the respective 
alleles. Some studies have shown that ASM is as sociated with 
SNPs (iKerkel et all 120081 : IShoemaker et all 120101) . The density 
of methylation across the whole genome, however, is much higher 
than that of DNA sequence polymorphism. For instance, while there 
are close to 3 million SNPs discovered in the human genome ~62 
million and 45 million methylcytosines were detected in HI and 
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IMR90 cells JLister et oli 120091) . It has also been observed that 
changes in cytosine methylation occur at a frequency much greater 
than that of the DN A sequence mutations dOssowski et fl/lEoiOt 
ISchmitz et a/.Ll201lb . As a result, SNPs are absent in large portions of 
the methylomes. We will describe a method in this article that detects 
ASM without the assistance of SNPs. In addition, even though the 
functionalities of ASM are not w ell understood except that th ey play 
an i mportant role in imprinting jHellman and ChessL 120071: Kerkel 



et al, 120081) , it seems that the methylation level of an individual 
cytosine is less important than the overall levels of methylations 
within a region, which is also in contrast to the SNPs. We, therefore, 
focus our effort in detecting regions of ASM. 

The remainder of the article is organized as follows. A mixture 
model is described in Section [2] for modeling the outcome of a 
methylation sequencing experiment where the sample may contain 
a mixture of heterogeneous epigenomes. It aims at predicting 
methylation levels for each cytosine in each individual epigenome. 
Section |3] lays out the details for detecting regions of ASM based 
on the mixture model and validates the methods on a synthetic 
methylome. The methods are then applied to an Arabidopsis root 
cell methylome and the results are listed in Section [4] Section [5] 
discusses the results and offers some future directions. 

2 A MIXTURE MODEL FOR HETEROGENEOUS 
EPIGENOMES 

In bisulfite sequencing experiments, DNA fragments are treated 
with sodium bisulfite. The process converts unmethylated cytosines 
into uracils. The sequence of nucleotides (reads) in the converted 
fragments are subsequently determined by a sequencer. The reads 
produced by the sequencer are aligned to a reference genome. 
Usually only uniquely mapped reads are retained. As a result, what 
we have is a set of reads that are most similar in sequence to their 
respective mapped locations in the reference genome, which are 
presumably the genomic origins of the fragments that produced the 
reads. In addition, each cytosine on every mapped read is labeled as 
either methylated or unmethylated. 

The methylation level of a particular cytosine is computed as 
follows: if there are x reads that map to the position, and y out of the 
x reads have at this position a methylcytosine, then the methylation 
level is y/x. Note that if the read depth at a cytosine position is 
below a certain threshold, which is determined by the allowed false 
positi ve rate, the methylation is not called, i.e. y = 0. jLister et a/1 
l2QQ8h 

If the original sample is composed of a mixture of epigenomes, 
be it from a set of different cell types, tissues or alleles, the mapped 
reads will reflect the mixture. Our goal is to infer the original makeup 
of the mixture from the mapped reads. It should be noted that the 
attainment of the goal depends on whether the original epigenomes 
are sufficiently heterogeneous so that we may distinguish them. 

As we are only concerned with methylation, we restrict the reads 
to the genomic positions where the methylation level is greater 
than zero. The epigenomes and reads may be represented as binary 
strings, where methylcytosine is set to 1 and the remainder to 0. 
Let R be a set of binary strings, which we assume are the reads 
produced by a bisulfite sequencing experiment further restricted to 
methylation sites. For string reR, let x n - be the letter appearing 
at position i from r; let [r fl ,r^] be the positions that r spans. Let 
C = {cj\j= !•••£} be the set of natural frequencies of epigenomes, 



where cj is the frequency of the y'-th epigenome, and k is the total 
number of epigenomes. When the model is used to detect the ASM of 
a diploid organism, k equals to 2. Let M = {mij \ i = 1 • • -n J = 1 • • • k], 
where m/y is the probability of methylation of epigenome j at position 
i, and n is the length of the epigenome. The probability of observing 
string r is 

k 

P(r)= j2c jPrj , 

7=1 

where p r j is the probability that string r originates from epigenome 
j, and 

Prj= (mijx r i+(l-mij)(l-x ri )). (1) 
i=r a 

The probability of observing the set R is therefore 
P(M,C,R)=Y\P(r), 

reR 

or equivalently the log likelihood 

l(M,C,R) = ^logP(r). 
reR 

The optimization goal is to determine parameters C and M such 
that the probability of observing the set R is maximum, thus 
best explaining the reads. We estimate array C and matrix M by 
maximizing the likelihood /, 

argmax M c ^logP(r), 
reR 

which can be solved by using the expectation-maximization (EM) 
algorithm jPempster et all Il977h . First, we define a membership 
matrix 



A = {a rj \reRJ=l...k}, 



where 



1 if re;, 
0 ifr£/\ 



As the membership of string r with respect to epigenome j is 
unknown, it is estimated by its expected value as 

c jPrj 



HfCfPrf' 



(2) 



The likelihood can then be rewritten as 



l(M ,C,R,A) = J2J2 a rJ l °S( c jPrj)' 
reRj=l 



(3) 



and the optimization becomes 



k 

argmax M c J2J2 ar J Xo ^ c jPrj)- 
reRj=\ 

Solving the maximization constrained by J2j=\ c j = ^ yields the 
update equations at each M-step iteration of the EM algorithm as 
follows (see Appendix A for the detailed derivation), 



Er fl r 



rf 



^2r a rj x ri 



When the algorithm converges, the matrix M contains the predicted 
methylation levels for each epigenome in the mixture. 
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3 DETECTION OF ASM 

In this section, we describe how the model is used to detect 
allele- specific methylated regions in a diploid organism. Notice 
that ASM is not a precisely defined term. It generally refers to a 
significant difference between the methylation levels of the two 
alleles. First, the methylome of a diploid individual is scanned 
for partially methylated regions (PMRs) as candidates for further 
analysis. Second, for each candidate region, the reads that align 
to this region are computationally assigned to the two alleles and 
the methylation levels of individual cytosines from each allele are 
estimated. Last, regions are classified to allele- specific or non- 
specific methylated regions, based on read assignments and the 
predicted methylation levels from the previous step. A synthetic 
methylome is used to test the model and to illustrate the details of 
each step. 

We remark that determining whether a read along with its 
methylcy to sines has a higher probability to originate from one 
allele or the other relies on the differences between the reads, i.e. 
the methylation states of the cytosines on the reads. The density 
of methylcytosines of a genome relative to the read length in a 
sequencing experiment is therefore critical. For instance, if, on 
average, a read covers at most one methylcytosine, then there 
is very little hope to deconvolve the allelic methylation states 
without additional information. While anticipating the rapid growth 
of the read length in high-throughput sequencing technology, we 
first tested our method on Arabidopsis thaliana, which has a 
reasonably high methylation density. The median genomic distance, 
for instance, between consecutive methylcytosines on Chromosome 
1 of A. thaliana Col-0 is 15, while the typical read length of an 
Illumina sequencer is between 100 and 150 bp presently. 

3.1 Identify PMRs as candidates 

To detect ASM regions, the whole methylome is scanned for PMRs 
as candidates, as there is obviously not much difference between 
the two alleles if the methylation level of a region is near nil 
or complete. A contiguous methylated region (CMR) refers to 
a genomic region where the genomic distance between any two 
consecutive methylcytosines is no larger than a separation threshold 
s, which is set at around a width comparable to the read length. 
Each CMR is scanned with a fixed-width sliding window where 
the window width is the number of methylcytosines. A fixed- width 
window inside a CMR is classified as a PMR if n o fewer than 90% o f 
the methylcytosines are at most 70% methylated (ILister et a/.Ll2Q09h . 
A PMR may be called with or without a specific lower bound for 
methylation levels. In the data we have analyzed, the average level 
of methylations in a CMR is at least 25%, due to that (i) the region 
has contiguous methylcytosin es; (ii) the allowed f alse positive rate 
for calling a methylcytosine (iLister et all 120081) in combination 
with the sequencing coverage dictates an implicit lower bound on 
the methylation levels. Consecutive PMRs within the same CMR 
are merged into a single region. 

The dataset being tested is a synthetic methylome that is made up 
by combining the reads of two methylomes from Arabidopsis root 
cells: epidermis (Wer+) and endodermis (Scr+). The root cells are 
obtained by flow sorting; their m ethylomes by MethylC-Seq bisulfite 
sequencing (ILister et a/ll2008l) . The genomic length of the reads is 
83 bp. The reads from one cell type in the mixture are treated as 
if they are from one allele of the synthetic methylome. The ASM 



Table 1. Samples of partially methylated regions 
for classification 





>20 mC 


>35 mC 


>50 mC 


Samples 


CS 


NS 


CS 


NS 


CS NS 


Training 


255 


255 


42 


44 


10 10 


Testing 


100 


100 


19 


21 


6 6 



mc: methylcytosine; CS: cell-specific; NS: non-specific 



of the synthetic methylome is, therefore, simulated by cell-specific 
methylation in the mixture. Forward strand and reverse strand are 
processed separately. 

As ASM and cell- specific methylation arise from different 
biological processes, the patterns of differential methylations 
might, therefore, carry signatures unique to each type. The 
aforementioned synthetic methylome is appropriate for testing cell- 
specific methylations; yet whether it is a good surrogate for testing 
ASMs may be questioned. We argue that since the classification is 
largely based on overall methylation levels within a region rather 
than relations among methylation levels of individual cytosine 
sites, the criteria similarly apply to both cell-specific methylation 
and ASM. 

The separation threshold s is set to 100 bp. Each CMR is scanned 
with a sliding window of width 20, 35 and 50 methylcytosines 
respectively. Each PMR in the synthetic methylome is labeled as 
either cell-specific (allele- specific), if no fewer than 90% cytosines 
in the region are methylated in only one of the two cells, or non- 
specific otherwise. For the purpose of learning, consecutive windows 
within the same CMR are merged into one region only if they are 
labeled as the same type. For a sliding window of width 20, both 
cell- specific regions and non-specific regions have a median width 
of 23 methylcytosines. All of the cell- specific regions and around the 
same number of randomly selected non-specific regions are kept as 
samples for further analysis. The samples are divided into training 
and testing samples (Table [TJ- 

3.2 Predict allelic methylation levels 

For each PMR, the algorithm described in Section|2]is applied with 
k = 2. For each sample of the synthetic methylome, the reads from 
the two individual methylomes are mixed together. 

One potential problem with the EM algorithm is that it may 
converge to a local optimum. There are various ways to initialize 
the parameters. One initialization is to randomly assign each read 
to a cluster, i.e. set a r j = \ at probability \/k. Another option is 
to set cj = l/k, and then randomize matrix M. A third option is 
to randomize the membership matrix A, which appears to be the 
best option after testing with simulated data. We run the algorithm 
L times, each with a new random matrix A as the initialization. 
Let A 1 , A 2 , . . . , A L be the membership matrix when each individual 
run converges, from which two new initializations are derived: 
A L+1 and A L+2 , for two additional runsQ They are defined as 



1 A subtlety here concerns the ordering of the clusters at the end of each run. 
The details are omitted. 



i165 



Q.Peng and J.R.Ecker 




Fig. 1. Experimental (a and b) versus predicted (c and d) methylation levels of the synthetic methylome. Experimental methylation levels from Wer+ are 
shown in circles, Scr+ in stars. Two symbols in the bottom figures (c and d) indicate two predicted individual methylomes for the synthetic methylome. The 
left (a and c) is of an allele-specific region: chromosome 1: [7313026, 7313482] on forward strand. The region contains 50 methylcytosines (mC); 160 reads 
are aligned to this region. The right (b and d) is of a non-specific region: chromosome 3: [12698733, 12699133] on forward strand. The region contains 42 
mC; 367 reads are aligned to the region 



follows: for reR,j=l...k, 
i=l 

a L+2 f 1 iO" = argmax| /=1 (^+ 1 ) 
rj I 0 otherwise. 

Of the total L + 2 runs, the one that converges to the largest likelihood 
is selected. The matrix M yields a predicted level of methylations at 
each cytosine site. Figure[T]illustrates two samples of PMRs from the 
synthetic methylome: one cell- specific and the other non-specific. 
Both experimental and predicted methylation states are shown. The 
experimental data are from the individual methylomes and are thus 
treated as golden. The predicted methylation states are used for 
further classification. 

3.3 Classify candidate regions with a support vector 
machine classifier 

Once the methylation levels of individual methylcytosines in each 
allele are estimated and the membership of reads predicted by the 
model, what remains is to characterize and classify each region based 
on the estimations. Recall that we used a rather simplistic rule to 
automatically label the samples in the synthetic methylome when 



preparing the training and testing samples. The labels are derived 
from the knowledge of the two individual methylomes making 
up the synthetic methylome, and therefore are independent of the 
predictions made by the mixture model. One option is to use the 
same rule to classify a region from predicted methylation levels. We 
hope to capture more characteristics of the two classes, however, so 
multiple measures are employed for this task when it is applied to 
the real epigenome. We also use the synthetic methylome to train a 
support vector machine (SVM) classifier that is later used for a real 
single cell methylome. Notice that the value a r j gives the probability 
that the read r is from allele j. If a read needs to be assigned to a 
single allele, it should be assigned to the j that has the larger value 
of a r j, or formally, to cluster j r = argmaxj(a r j). 

The features for SVM are extracted from the estimated 
methylation levels (M) and allele frequencies (C). A total of nine 
features are used (details omitted due to exigencies of space). Both 
linear and radial basis function (RBF) kernels are tested; the latter 
yield better performance. Five-fold cross-validation is used to select 
the best parameters for the kernel function from each training set. 
The testing results are shown in Table[2] In some of the false positive 
(FP) samples, both cells have completely unmethylated reads and 
these reads are clustered together in the synthetic methylome, which 
are then classified as allele-specific (cell- specific). We hypothesize 
that these regions have ASM in both individual cell types. 
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Table 2. Testing results for classification of partially methylated 
regions 



>20 mC >35 mC >50 mC 



Samples 


CS NS 


CS 


NS 


CS 


NS 


Predicted CS 


88 (TP) 13 (FP) 


17 


3 


5 


1 


Predicted NS 


12 (FN) 87 (TN) 


2 


18 


1 


5 


Accuracy 


87.5% 


87.5% 


83.3 


% 



CS: cell-specific; NS: non-specific 



Table 3. Classes based on differential averaged 
methylation 



d 


Class name 




Label 


[0.0,0.2] 
(0.2,0.9) 
[0.9,1.0] 


Similarly methylated d s 
Moderately differentially methylated d m 
Highly differentially methylated <4 


Table 4. Predicted ASM regions 


Method 


svm 


Imr dmr 


mww 


Intersection 


Regions 


277 
/ 
/ 


39 d h : 19 
/ d m : 368 

d m 


362 

/ 

/ 


IS* 1 * 
16( 2 ) 
192( 3 ) 



3.4 Identify ASM regions with multiple filters 

Using an SVM classifier has its limitations. A classifier trained on 
one organism will certainly not be appropriate for other organisms. 
A classifier trained on one cell type may be questionable for another 
very different type. When parameters for the sequencing experiment 
change, the classifier should be retrained. In addition, there may not 
be sufficient data samples for learning at all. Additional measures 
are therefore necessary for real data. 

The Mann-Whitney-Wilcoxon (mww) test is performed on the 
two clusters resulting from the mixture model. If the null hypothesis 
is rejected, one cannot readily claim that the methylations are allele- 
specific. But if on the other hand the null hypothesis is not rejected, 
it is unlikely that the methylations are allele- specific. Two additional 
filters are based on methylation rates. One filter is called low 
methylation rate (Imr). It computes whether one of the two clusters 
is < 10% methylated at > 80% sites. The other is called differential 
averaged methylation rate (dmr). The averaged methylation rate for 
each cluster in the region is defined as 

1 A mij 

r; = > , 

n ~[ m i\ + m i2 

and the differential averaged methylation as 

d = \n-r 2 \. 

We define three classes based on this measure shown in Table [3] 



4 DETECT ASM REGIONS FOR ARABIDOPSIS 
EPIDERMIS METHYLOME 

The methods are applied to the methylome of a single cell type, 
a root epidermis (Wer+) cell from A. thaliana Col-0. There are a 
total of 452 partially methylated regions as defined in Section [TT] 
The total genomic length of the regions is 84 397 bp, covering 9980 
methylcytosines. The 22926 reads are aligned to these regions. 

Once the methylation levels of the two alleles and the membership 
of all reads are predicted by the models, the results are subject to 
four methods for classification as mentioned in Sections l331 and l34l 
Recall that the SVM model is trained on the synthetic methylome 
made up of two Arabidopsis root methylomes, Wer+ being one of 
them. Table |4] shows the number of PMRs predicted to be allele- 
specific by each method and the intersection is taken so as to obtain 
the most conservative predictions. The first two rows of the table 
reflect the intersections taken between the d^ class and all other 
criteria, and d m class and other criteria, respectively, totaling 34 
regions. Table \5\ lists the details and annotations for group (1), and 
Table [6] for group (2). Based on the predicted methylation levels, 
both groups are highly allele- specific. While one of the two alleles 
has nearly no methylation, in the first group of 18 regions, the other 
allele is in general highly methylated; and in the second group 
of 16 regions, the methylated allele is more partially methylated. 
An example from each group is shown in Figure |2] If the criteria 
for ASM are relaxed a bit by removing the rather stringent filter 
Imr, many more regions [the last row (3) in Table 0) are admitted 

Table 5. Predicted ASM regions, group (1) in Table|4] 



ch Coord start Coord end No. of mC str Gene model 



2 
2 

2 

3 
3 

4 
4 

4 



4 
5 

5 
5 

5 



ch: chromosome; str: strand; <: upstream; >: downstream; Signs in the last column 
indicate opposite strands. 



6173395 


6173517 


20 


+ 


< AT1G17940 - 


17295458 


17295563 


24 


+ 


AT1TE57315 - 


17824930 


17825022 


20 


+ 


AT1TE59180 - 


18450185 


18450342 


20 




AT1TE61145 


21929675 


21929885 


21 


+ 


> AT1G59660 










< AT1G59670 


7089119 


7089281 


21 




AT2G16380 3'UTR + 


7340522 


7340662 


23 


+ 


AT2TE29970 










< AT2G16930 


14386759 


14387012 


23 


+ 


> AT2G34060 










> AT2G34070 - 


12092053 


12092151 


20 


+ 


AT3TE50300 - 


16726743 


16726938 


29 




< AT3G45570 










AT3TE67795 + 


2367251 


2367352 


20 


+ 


< AT4G04670 


10912792 


10913005 


25 


+ 


> AT4G20210 - 










AT4TE50030 - 


11932172 


11932467 


24 


+ 


AT4TE55215 - 










> AT4G22690 










< AT4G22700 


13269127 


13269376 


28 




AT4TE62330 


5128508 


5128716 


22 




AT5TE18530 










> AT5G15725 + 


8215115 


8215303 


24 


+ 


AT5TE29690 - 


22267311 


22267477 


25 




AT5TE80145 + 










< AT5G54810 


26117272 


26117538 


20 


+ 


AT5TE94040 
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Table 6. Predicted ASM regions, group (2) in Table |4] 



ch 


Coord start 


Coord end 


No. of mC 


str 


Gene model 


1 


18054504 


18054609 


20 




< AT1G48820 + 












< AT1G48810 


1 


21249227 


21249472 


20 




AT1TE70195 - 












AT1TE70200 — 


1 


29696108 


29696354 


26 


+ 


< AT1G78960 


2 


2168534 


2168695 


21 




AT2TE09960 












> AT2G05752 


2 


11844725 


11844944 


20 




AT2TE51520 












> AT2G27780 


2 


11844762 


11845016 


25 




AT2TE51520 












> AT2G27780 


3 


10865759 


10866031 


21 




> AT3TE45185 - 


3 


12092064 


12092269 


25 


+ 


AT3TE50300 - 


3 


16266681 


16266822 


21 


+ 


AT3TE65915 - 












< AT3G44718 - 


3 


16934173 


16934369 


20 




AT3TE68630 + 












> AT3G46110 + 


4 


4097347 


4097458 


20 


+ 


> AT4TE 17760 


4 


4547945 


4548048 


39 




AT4TE19110 












< AT4G07747 


5 


13812896 


13813035 


21 


+ 


AT5TE49235 - 


5 


15205421 


15205719 


23 


+ 


AT5TE55020 


5 


15268332 


15268498 


21 




AT5TE55235 












< AT5G38220 + 


5 


17406449 


17406557 


22 


+ 


AT5TE62865 - 



ch: chromosome; str: strand; <: upstream; >: downstream; Signs in the last column indicate opposite strands. 



(a) 40 



(b)25 r 



30 



~S 20 



10 



10 



15 
mC 



20 



25 



(c) 



J> 0.3 

1 0.6 
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Fig. 2. Read assignments (a and b) and methylation levels (c and d) of the predicted ASM regions of Arabidopsis Wer+ cell. Circle and star symbols in 
c and d represent two alleles. Lines in top (a and b) figures are restricted reads (solid and dotted lines represent two alleles); small diamonds on lines are 
methylations. Left (a and c) : a region of Tablegjl): chrom 4: [13269127, 13269376] - strand. Right (b and d) : a region of Tabled): chrom 5: [17406449, 
17406557] + strand 
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Table 8. GO annotation summary for protein coding genes in Table [7] 



Functional category 



Gene body Upstream Downstream 



Fig. 3. An example from Table|4]group (3): region: chrom 1: [77393, 77508] 
— strand, 20 mC. (a) reads assignments, (b) predicted methylation levels 



Table 7. Annotation summary of all ASM regions in Table |4] 



Annotation 


Overlap 
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Downstream 


Protein coding gene 


4 (4) exon 


50 (24) 
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mi, t, other RNA 
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Transposon (only) 


66 (39) 







Numbers in parenthesis are on the opposite strands. Upstream and 
downstream are within lkb. 



for further examination. An example from this group is shown in 
Figure [3] 

Table [7] summarizes the TAIR9 annotations (ITAIRL 1200^) for all 
three groups of a total of 226 predicated ASM regions. Many regions 
overlap with natural transposons as expected; the last row in the 
table reports the number of regions that have no other annotation 
than transposon. For the protein coding genes, the gene ontology 
annotations are summarized in Table [8] One region from group (3), 
on forward strand of chromosome 1: 11267775 - 11268327, is on 
the antisense of an exon of protein coding gen e AT3G29360, an 
imprin ted gene in A. thaliana seed reported by iMcKeown et al\ 
(l201ll) . The region has 26 methylcytosines. 
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5 CONCLUSION AND DISCUSSION 

We have described a computational model for resolving distinct 
epigenomes from a heterogeneous sample. In particular, we applied 
this model to identify allele- specific methylated regions. The reads 
from potentially multiple epigenomes are mapped to a common 
reference genome. The goal is essentially to infer the distinct 
methylation patterns from the mapped reads. Our approach is 
different from previous attempts in that it does not rely on 
SNPs, which are few and far between when compared with 
methylcytosines. 

The model was tested on a synthetic methylome. The classification 
based on the mixture model in conjunction with an SVM classifier 
yielded an accuracy of 87.5%. Even though the SVM approach is 
not always applicable to real methylomes, since all the features 
are derived from the predicted methylation levels and cluster 
frequencies, the test results reflect the reliability of the predictions 
made by the model. Additional multiple filters for ASMs may further 
reduce the number of false positives. 

Additional approaches may be employed to validate the methods 
for ASM detection. One approach is to use ASMs determined with 
SNP data as ground truth, although such data are presently sparse 
and anecdotal. Another approach to validate ASM is to use heritable 
epi-alleles in combination with phasing information obtained from 
crossing of plants. 
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We applied the methods to regions of the genome with relatively 
high density of methylcytosines with each region being treated 
independently. By taking advantage of pair-end reads and other 
information, it will also be possible to do phasing and extend 
and connect the regions. Our model assumes that methylations are 
independent of each other. Methylations in some regions have a 
tendency to occur in clusters, which indicates a certain dependency. 
While our model gives a reasonably good first-order approximation, 
Markov chain-based models perhaps may be explored to take the 
dependency into consideration. Identifying ASM is still only at 
initial research stages. Another direction for future research is to 
focus on understanding the functionalities of ASM, for instance, 
how they are related to allele- specific expression. 

More complicated heterogeneous epigenome samples may arise 
from a mixture of various cell types, or a mixture of cancerous cells 
at various stages, which present yet more and greater challenges than 
the allelic methylations of a diploid cell. Such samples will enable 
an ultimate test for the power of the methods. The initial steps are 
to develop scenarios and criteria for validation as it becomes less 
obvious what defines cell- specific methylations in the context of 
multiple cell types. 
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APPENDIX A 

Section [2] described the model for a mixture of epigenomes. This 
section details the derivation of the update equations at each M-step 
iteration for the EM algorithm. 

Recall from Section [2] that the optimization goal is to determine 
the methylation probability matrix M and the epigenome frequency 
array C such that the likelihood l(M ,C,R,A) as defined by 
Equation ^ is maximized given the set of observed restricted reads 
R. The membership matrix A is estimated by Equation $2% at each 
iteration. 

As the maximization of likelihood l(M , C,R,A) is constrained by 
S/=i9 = l, i- e - me epigenome frequencies sum up to 1, we will 
introduce Lagrange multiplier X and maximize the unconstrained 
function 

k / k \ 

l(M,C,R,A^) = J2J2 a rj l °B(cjP rj )-X J>-1 • 
reRj=l \j=l ) 

Substituting in p r j given by Equation (OJ, we have 

* /* \ 

l=^2^2(a rj logp r j+a rj logCj)-Xl J2 c J~ l 
reRj=l V=l / 

k r b 

= ^2^2^2 a r j log (mijXji + (1 - Mij )(1 -Xri)) 
reRj=li=r a 

reRj=l \j=l / 

k r b 

= Y^y^ a rj (x r ilogmij + (l -x n )log(l -mij)) 
reRj=li=r a 

k I k \ 

reRj=l V=l / 
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Maximizing / with respect to M , C, k by setting the respective partial 

derivatives to zero yields the following set of equations ^ 

/ \ — =° 

dl n v-^ / x r j 1 — x r j \ dk 

dmtj V \ m ij l ~ m ij 



^2a r j(x ri -mij) = 0, 



C > k 



r j 



Solving for m/y and substituting in k for cj lead to the update 
ol ^ a ^L—x = 0 equations at each M-step iteration of the EM algorithm as follows, 
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